arXiv:1508.01585v2 [cs.CL] 2 Oct 2015 


APPLYING DEEP LEARNING TO ANSWER SELECTION: 
A STUDY AND AN OPEN TASK 


Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou 


IBM Watson 

Yorktown Heights, NY, USA, 10598 

<mfeng|bingxia|mrglass|wangli|zhou>@us.ibm.com 


ABSTRACT 

We apply a general deep learning framework to address the 
non-factoid question answering task. Our approach does not 
rely on any linguistic tools and can be applied to different lan¬ 
guages or domains. Various architectures are presented and 
compared. We create and release a QA corpus and setup a 
new QA task in the insurance domain. Experimental results 
demonstrate superior performance compared to the baseline 
methods and various technologies give further improvements. 
For this highly challenging task, the top-1 accuracy can reach 
up to 65.3% on a test set, which indicates a great potential for 
practical use. 

Index Terms — Answer Selection, Question Answering, 
Convolutional Neural Network (CNN), Deep Learning, Spo¬ 
ken Question Answering System 

1. INTRODUCTION 

Natural language understanding based spoken dialog system 
has been a popular topic in the past years of artificial intelli¬ 
gence renaissance. Many of those influential systems include 
a question answering module, e.g. Apple’s Siri, IBM’s Wat¬ 
son and Amazon’s Echo. In this paper, we address the Ques¬ 
tion Answering (QA) module in those spoken QA systems. 
We treat the QA from a text matching and selection perspec¬ 
tive. IBM’s Watson system HI is a classical example of the 
traditional way of doing Question Answering (QA). In this 
work we utilize a deep learning framework to accomplish the 
answer selection which is a key step in the QA task. Hence 
QA is studied from an answer matching and selection per¬ 
spective. Given a question q and an answer candidate pool 
{oi, 02 ,..., Os} for that question (s is the pool size), the goal 
is to find the best answer candidate Ofe, 1 < /c < s . If the se¬ 
lected answer Ofe is inside the ground truth set (one question 
could have more than one correct answer) of question q , the 
question q is considered to be answered correctly, otherwise it 
is answered incorrectly. From the definition, the QA problem 
can be regarded as a binary classification problem. For each 
question, for each answer candidate, it may be appropriate or 
not. In order to find the best pair, we need a metric to measure 


the matching degree of each QA pair so that the QA pair with 
highest metric value will be chosen. 

The above definition is general. The only assumption 
made is that for every question there is an answer candidate 
pool. In practice, the pool can be easily built by using a gen¬ 
eral search engine like Google Search or an information re¬ 
trieval software library like Apache Lucene. 

We created a data set by collecting question and answer 
pairs from the internet. All these question and answer pairs 
are in the insurance domain. The construction of this insur¬ 
ance domain QA corpus was driven by the intense scientific 
and commercial interest in this domain. We released this cor¬ 
pus 0 to create an open QA task, enabling other researchers 
to utilize it and supporting a fair comparison among different 
methods. The corpus consists of four parts: train, develop¬ 
ment, testl and test2. Table [T] gives the data statistics. All 
experiments conducted in this paper are based on this corpus. 
To our best knowledge, it is the first time an insurance domain 
QA task has been released. 

Our QA task requires specifying an answer candidate pool 
for each question in the development, testl and test2 parts 
of the corpus. The released corpus contains totally 24981 
unique answers. It is possible to use the whole answer space 
as the candidate pool, so that each question must be compared 
with 24981 answer candidates. However, this is impractical 
due to time consuming computations. In this paper, we set 
the pool size to be 500, so that it is both practical and still a 
challenging task. We put the ground truth answers into the 
pool and randomly sample negative answers from the answer 
space until the pool size reaches 500. 

The technology described in this paper with the released 
data set and benchmark task is targeting potential applications 
like online customer service. Hence it is not supposed to han¬ 
dle question answering tasks that require reasoning, e.g. is 
tomorrow Tuesday? (answer depends on if today is Monday.) 
The rest of the paper is organized as follows: Sec. 2 describes 
the different architectures used this work; Sec. 3 provides the 
experimental setup details; experimental results and discus¬ 
sions are presented in Sec. 4; Sec. 5 contains related work 

*git clone https://github.com/shuzi/insuranceQA.git 




Questions 

Answers 

Question Word Count 

Train 

12 887 

18 540 

92 095 

Dev 

1000 

1454 

7158 

Testl 

1800 

2 616 

12 893 

Test2 

1800 

2 593 

12 905 


Table 1: Corpus statistics: first two columns are the question and 
answer quantity; notice there could be multiple answers for some 
questions so that the answer quantity is larger than the question quan¬ 
tity; third column is the question total word count. The total number 
of answers is 24 981 and the whole answer text contains 2 386749 
words 


and finally we draw conclusions in Sec. 6. 


2. MODEL DESCRIPTION 

In this section we describe the proposed deep learning frame¬ 
work and many variations based on that framework. However, 
the main idea of those different systems is the same; learn a 
distributed vector representation of a given question and its 
answer candidates and then use a similarity metric to mea¬ 
sure the matching degree. We first developed two baseline 
systems for comparison. 


2.1. Baseline Systems 

The first baseline system is a bag-of-words model. Step one is 
to train a word embedding by la. This word embedding pro¬ 
vides the word vector for each token in the question and its 
candidate answer. From these, the baseline system produces 
the idf-weighted sum of word vectors for the question and for 
all of its answer candidates. This produces a vector represen¬ 
tation for the question and each answer candidate. The last 
step is to calculate the cosine similarity between each ques¬ 
tion/candidate pair. The pair with highest cosine similarity is 
returned as the answer. The second baseline is an information 
retrieval (IR) baseline. The state-of-the-art weighted depen¬ 
dency model (WD) Elia is used. The WD model employs 
a weighted combination of term-based and term proximity- 
based ranking features to score each candidate answer. Ex¬ 
ample features include counts of question bigrams in ordered 
and unordered windows of different sizes in each candidate 
answer, in addition to simple unigram counts. The basic idea 
is that important bigrams or unigrams in the question should 
receive higher weights when their frequencies are computed. 
Thus, the feature weights are assigned in accordance to the 
importance of the question unigrams or bigrams that they are 
defined over, where the importance factor is learned as part of 
the model training process. Row 1 and 2 (first column Idx) of 
Table|2]are the baseline system results. 


2.2. CNNs-based System 

In this paper, a QA framework based on Convolutional Neural 
Networks (CNN) is presented. As summarized in Chapter 11 
of la, a CNN leverages three important ideas that can help 
improve a machine learning system; sparse interaction, pa¬ 
rameter sharing and equivariant representation. Sparse 
interaction contrasts with traditional neural networks where 
each output is interactive with each input. In a CNN, the fil¬ 
ter size (or kernel size) is usually much smaller than the input 
size. As a result, the output is only interactive with a narrow 
window of the input. Parameter sharing refers to reusing the 
filter parameters in the convolution operations, while the el¬ 
ement in the weight matrix of traditional neural network will 
be used only once to calculate the output. Equivariant rep¬ 
resentation is related to the idea of fc-MaxPooling which is 
usually combined with a CNN. In this paper we always set 
fc = 1. So each filter of the CNN represents some feature, 
and after the convolution operation, the 1-MaxPooling value 
represents the highest degree that the input contains the fea¬ 
ture. The position of that feature in the input is irrelevant due 
to the convolution. This property is very useful for many NLP 
applications. Below is an example to demonstrate our CNN 
implementation. 
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The left matrix W is the input sentence. Each word is rep¬ 
resented by a 3-dimensional word embedding vector and the 
input length is 4. The right matrix F represents the filter. The 
2-gram filter size is 3 x 2 . The convolution output of the 
input W and the filter E is a 3-dim vector O , assuming zero 
padding has been done so that only a narrow convolution is 
conducted. 

Ol = Wllfll Wl2.fl2 + VJlsfls UI21/2I + W22/22 -I- W23/23 

O2 = W2lfll W22/12 + UI23/13 + UI31/2I + W32/22 + WI33/23 

O3 = UI31/11 -I- W32/12 + UI33/13 + UI41/2I + UI42/22 -I- UI43/23 

( 2 ) 

After 1-MaxPooling, the maximum of the 3 values will be 
kept for the filter F which indicates the highest degree that 
filter F matches the input W . 

2.3. Training and Loss Eunction 

Different architectures will be described later. However all 
those different architectures share the same training and test¬ 
ing mechanism. In this paper we minimize a ranking loss 
similar to a El. During training, for each training ques¬ 
tion Q there is a positive answer A'^(the ground truth). A 
training instance is constructed by pairing this A+ with a neg¬ 
ative answer A~(a wrong answer) sampled from the whole 
answer space. The deep learning framework generates vec¬ 
tor representations for the question and the two candidates; 







Fig. 1 : Architecture I . Q for question; A for answer; P is 1- 
MaxPooling; T is tank layer; HL for hidden layer and HL already 
includes tank as its activation function. 



Fig. 2: Architecture II . QA means the weights of corresponding 
layer are shared by Q and A . 

Vq 7 Va+ and Vyi- . The cosine similarities cos(Vq,V 4 +) 
and cos(Vq, V^-) are calculated and the distance between the 
two similarities is compared to a margin: cos(Vq,V^+) — 
cos(Vq, V^-) < m . m is the margin. When this condition is 
satisfied, the implication is that the vector space embedding 
either ranks the positive answer below the negative answer, or 
does not sufficiently rank the positive answer above the neg¬ 
ative answer. If cos(Vq, V'^+) — cos(Vq, V^-) >= m there is 
no update to the parameters and a new negative example is 
sampled until the margin is less than m (to reduce running 
time we set maximum 50 times in this paper). The hinge loss 
function is hence defined as follows: 

L = max {0,m — cos(Vq, Va+) + cos{Vq, V^-)} (3) 

For testing, we calculate the cos{Vq,V candidate) between the 
question Q and each answer candidate Vcandidats in the pool 
(size 500). The candidate answer with largest cosine similar¬ 
ity is selected. 

2.4. Architectures 

In this subsection we demonstrate several proposed architec¬ 
tures for this QA task. Figure [T] shows the Architecture I . 
Q is the input question provided as input to the first hid¬ 
den layer HLg. The hidden layer (HL) is dehned as z = 
tanh{Wx + B). W is the weight matrix; B is the bias vector; 
X is input; z is the output of the activation function tank. The 
output then flows to the CNN layer CNNg, applied to extract 
question side features. P is the MaxPooling layer (we always 
use 1-MaxPooling in this paper) and T is the tank layer. Sim¬ 
ilar to the question side, the answer A is processed by HL.4 
and then features are extracted by CNNyi . 1-MaxPooling P 


Fig. 3: Architecture III. HL for hidden layer. Add another HLq and 
HLa after CNNqa . 



Fig. 4: Architecture IV . Add another shared hidden layer HLqa 
after CNNqa . 

and tank layer T will function in the last step. The result is a 
vector representation for both question and answer. The hnal 
output is the cosine similarity between these vectors. Row 3 
of Table l^is the Architecture I result. 

Figure|2]is the Architecture II. The main difference com¬ 
pared to Architecture I is that both question and answer sides 
share the same HL and CNN weights. Row 4 of Table|2]is the 
Architecture II result. 

We also consider architectures with a hidden layer after 
the CNN. Figure |3] is the Architecture III in which another 
HLq is added at the question side after the CNN and another 
HLa is added at the answer side after the CNN. Row 5 of 
Table |2]is the Architecture III result. Architecture IV, shown 
in Figure m is similar except the second HL of both question 
and answer share the same HLqa weights. The rows 6 and 7 
of Table|2]are the Architecture IV results. 

Figure |5] is the Architecture V where two layers of 
CNNqa are deployed. In section 2.2 we show the convo¬ 
lution output is a vector (3-dim in that example). This is 
true only for CNNs with a single filter. By applying multiple 
biters the result is a matrix. If there are 4 biters utilized for 
the example in section 2.2, the output is the following matrix. 
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Each row represents the output of one biter and each column 
represents a bigram of the input. This matrix is the input to 
the next CNNqa layer. For this second layer, every bigram is 
effectively one “word” and the previous biter’s output for that 
bigram is its word embedding. Row 11 of Table|2]is the result 
for Architecture V . Architecture VI in Figure |6] is similar 




















































Idx 

Dev 

Testl 

Test2 

Description 

1 

31.9 

32.1 

32.2 

Baseline: Bag-of-words 

2 

52.7 

55.1 

50.8 

Baseline: metzler-bendersky IR model 

3 

44.2 

41.7 

39.5 

Architecture I: HLq(200) HLa(200) CNNq(IOOO) CNNa(IOOO) 1-MaxPooling Tanh 

4 

58.2 

57.8 

53.6 

Architecture II: HLqa( 200) CNNQA(1000)l-MaxPooling Tanh 

5 

36.1 

33.6 

32.7 

Architecture III: HLqa(200) CNNqa(IOOO) HLq(IOOO) HLa(IOOO) 1-MaxPooling Tanh 

6 

51.4 

50.5 

46.1 

Architecture IV: HLqa(200) CNNqa(IOOO) HLqa(IOOO) 1-MaxPooling Tanh 

7 

47.0 

46.7 

43.0 

Architecture IV: HLqa(200) CNNqa(IOOO) HLqa( 500) 1-MaxPooling Tanh 

8 

60.6 

59.2 

55.1 

Architecture II: HLqa(200) CNNqa( 2000) 1-MaxPooling Tanh 

9 

61.5 

61.3 

57.8 

Architecture II: HLqa(200) CNNqa( 3000) 1-MaxPooling Tanh 

10 

61.8 

62.8 

59.2 

Architecture II: HLqa(200) CNNqa( 4000) 1-MaxPooling Tanh (best result in this table) 

11 

59.7 

59.3 

55.6 

Architecture V: HLqa(200) CNNqa(IOOO) CNNqa(IOOO) 1-MaxPooling Tanh 

12 

59.9 

60.6 

55.9 

Architecture VI: HLqa(200) CNNqa(IOOO) CNNqa(IOOO) 1-MaxPooling Tanh (2COST) 

13 

59.9 

58.7 

53.8 

Architecture II: HLqa( 200) Augmented-CNNQA(IOOO) 1-MaxPooling Tanh 

14 

60.0 

60.3 

54.3 

Architecture II: HLqa( 200) Augmented-CNNQA(2000) 1-MaxPooling Tanh 

15 

61.7 

62.2 

56.3 

Architecture II: HLqa( 200) Augmented-CNNQA(3000) 1-MaxPooling Tanh 


Table 2: Experimental Results. HL(200) means the hidden layer size is 200; CNN(IOOO) means there are 1000 filters used; top one precision 
of Dev, Testl and Test2 have been reported. 


to Architecture V except we utilize layer-wise supervision. 
After each CNNg^i layer there is 1-MaxPooling and a tank 
layer so that the cost function can be calculated and back- 
propagation can be conducted. The result of Architecture VI 
is in row 12 of Table |2] 

We have tried another three techniques to improve Archi¬ 
tecture II in Figure |2]. First, the CNN filter quantity has been 
increased, see row 8 9 and 10 of Table |2l Second, the convo¬ 
lution operation has been augmented to include skip-bigrams. 
Consider the example in section 2.2, for the input and one 
filter in Eq. [T] the augmented convolution operation will not 
only produce Eq. |2]but also the following discontinuous con¬ 
volution; 

04 = «’ll/ll -I- U'12/12 + IO13/13 -I- JOSlAl -I- 1032/22 + IO33/23 

05 = 1021/11 -I-1022/12 +1023/13 -r 1041/21 -I-1042/22 + 1043/23 

The 1-MaxPooling will still be applied to get the largest value 
among [oi, 02 , 03 , 04 , 05 ] so that this filter is automatically 
adapted to match a bigram or skip-bigram feature. Rows 13 
14 and 15 of Tabled show the results. Third, we investi¬ 
gate the similarity metric. Until now, we have been using 
the cosine similarity which is widely adopted for vector space 
models. However, is cosine similarity the best option for this 
task? Table [3 is the results for similarity metric study. Some 
metrics include hyperparameters and experiments with vari¬ 
ous hyperparameters have been conducted. We propose two 
novel metrics (GESD and AESD) which demonstrate superior 
performance. 

3. EXPERIMENTAL SETUP 

The deep learning framework in this paper has been built from 
scratch using Java. To improve speed, we adopt the HOG- 
WILD approach 18] . Each thread processes one training in- 



Fig. 5: Architecture V . Two shared CNNqa ■ 



Fig. 6: Architecture VI. Two shared CNNqa . Two cost functions. 


stance at one time and updates the weights of the neural net¬ 
works. There is no locking in any thread. The word embed¬ 
ding (100 dimensions) is trained by word 2 vec H and used 
for initialization. Word embeddings are also parameters and 
are optimized for the QA task. Stochastic Gradient Descent 
is the optimization strategy and the L2-norm is also added in 
the loss function. In this paper, the weight of the L2-norm 
is 0.0001, the learning rate is 0.01 and margin m is 0.009 . 
Those hyperparameters are chosen based on previous experi¬ 
ences in using deep learning on this data and they are not very 
sensitive within reasonable range. The utilized computing re- 





























Dev Testl Test2 


Description 


58.2 

57.8 

53.6 

58.5 

57.1 

53.3 

56.8 

54.6 

52.6 

55.0 

53.6 

48.2 

57.1 

53.7 

51.5 

55.3 

52.4 

48.7 

52.5 

51.0 

47.2 

61.3 

59.9 

57.0 

61.6 

60.2 

57.1 

60.2 

60.2 

55.7 

60.0 

60.3 

54.7 

60.2 

57.0 

54.4 

58.4 

57.3 

53.8 

60.8 

60.3 

57.0 

42.2 

42.5 

38.2 

41.4 

39.5 

36.0 

48.2 

45.1 

41.6 

51.0 

49.5 

46.4 

62.5 

61.4 

59.0 

62.9 

62.1 

59.3 

62.6 

62.1 

59.2 

63.1 

61.9 

58.2 

63.4 

61.7 

58.7 

62.8 

62.0 

57.7 


cosine: k{x,y) = 

polynomial: k{x, y) = {■yxy'^ + c)'^, 7 = 0.5, d = 2, 

polynomial: k{x, y) = (72:1/^ + c)”^, 7 = 1.0, d = 2 , 

polynomial: k{x, y) = {'yxy'^ + c)'^, 7 = 1.5, d — 2, 

polynomial: k{x, y) = {'jxy'^ + c)'^, 7 = 0.5, d — 3 , 

polynomial: k{x, y) = {'yxy^ + c)”^, 7 = 1.0, d = 3, 
polynomial: k{x, y) = {■yxy'^ + c)'^, 7 = 1.5, d = 3, 
sigmoid: k{x, y) = tanh{'yxy'^ + c), 7 = 0.5, c = 1 
sigmoid: fc(a;, 1/) = tanh{'yxy'^ + c), 7 = 1.0, c = 1 
sigmoid: k{x, y) = tanh{'yxy'^ + c), 7 = 1.5, c = 1 
RBF: A:(a:, y) = exp{—'y\\x — y||^), 7 = 0.5 
RBF: klx, y) = ea:p(- 7 i|a; - yf), 7 = 1.0 
RBF: k{x, y) = e®p(- 7 iia; - yjp), 7 = 1.5 
euclidean: k{x, y) = ^^||^_^|| 
exponential: k{x,y) = ea:p(— 7II2; — 3/||i), 7 = 0.5 
exponential: k{x,y) = ea:p(— 7II2; — 3/||i), 7 = 1.0 
exponential: k{x,y) = ea;p(— 7||a; — y||i), 7 = 1.5 
manhattan: klx.y) — , , ^ —m- 

GESD. k{x,y) = i+||,^_y|| • l + ea;p{- 7 (a:i;T+c)) ’ 7 = 
^{X,y) l + exp( — '^(xyT+c))’'y^ 

^{X,y) l-\-exp( — i<^xyT+c))’^ 


C = 1 
C = 1 
C = 1 
C = 1 
C = 1 
C = 1 
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GESD 

AESD 

AESD 

AESD 


k{x,y) = 
k{x,y) = 


0.5 


l+llp-I/ll 


7 = 

l-{-exp( — 'y(xyT +c)) ’ 7 


l + ea:p( —+c)) 


l+l|a: —!/ll l + ea:p(—^ 7 ^, 

^{x^y'} 1+||3: —y| l-{-exp( — 'y(xyT +c}) ’ 7 


0.5, c= 1 
1.0, c= 1 
1.5, c= 1 
0.5, c= 1 
= 1.0, c= 1 
= 1.5, c= 1 


63.5 

62.5 

60.2 

GESD 

II 

64.3 

65.1 

61.0 

GESD 

k{x,y) = 

65.4 

65.3 

61.0 

GESD 

11 

64.5 

62.7 

60.1 

AESD 

II 

64.3 

63.3 

62.2 

AESD 

11 

63.9 

64.5 

61.1 

AESD 

II 






l + ea:p( — 7 jxpT +c)) ’ 7 


l + e 3 :p(— ^7^xyT +c)) 


7 = 


1.0, 2000 filters 
1.0, 3000 filters 
1.0, 4000 filters 
: 1.0, 2000 filters 
1.0, 3000 filters 
1.0, 4000 filters 


Table 3: Experimental results of various similarities. All results in above part are based on Architecture II with 1000 filters (corresponding 
to Row 4 in Tablej^. In the bottom part, the results are based on Architecture II using the proposed metric with more filters. k{x, y) is the 
similarity between vector x and y. ||a;|| is the L 2 norm and ||a;||i is the Li norm, xy’’ represents the inner product of x and y. We always 
normalize the question and answer vectors before calculating the similarity. Highest number in each column is in bold font. 


sources for this work are enormous. We heavily occupy a 
Power 7 cluster which consists of 75 machines. Each machine 
has 32 physical cores and each core supports 2-4 hyperthread¬ 
ing. The HOGWILD approach will bring some randomness 
due to no locking. Even with locking, the thread scheduler 
would alter the order of examples between runs so that ran¬ 
domness would still exist. Therefore, for each row in Table |2] 
(except for row 1 2) and Table |3l 10 experiments have been 
conducted on the dev set and the run with best dev score is 
chosen to calculate the test scores. 

4. RESULTS AND DISCUSSIONS 

In this section, detailed analysis on experimental results are 
given. Erom Tabled and |3] the following conclusions can be 
made: (I) baseline 1 only utilizes word embeddings and base¬ 


line 2 is based on traditional term based features. Our pro¬ 
posed method can reach significantly better accuracy which 
demonstrates the superiority of deep learning approach; (2) 
using separate hidden layer (HE) or CNN layers for Q and 
A has worse performance compared to a shared HE or CNN 
layer (Table |2] row 3 vs. 4, row 5 vs. 6). This is reason¬ 
able because for a shared layer network, the corresponding 
elements in Q and A vector are guaranteed to represent the 
same CNN filter convolution result while for network with 
separate Q and A layers, there is no such constraint and the 
optimizer has to learn over a set of double sized parameters. 
Hence the optimizer faces greater difficulty; (3) adding a HE 
after the CNN degrades the performance (Table |2l row 4 vs. 
6 and 7). This proves that CNN already captures useful fea¬ 
tures for QA matching and unnecessary mapping the features 
to another space makes no sense at all; (4) increasing the CNN 





































filter quantity can capture more features which gives notable 
improvement (Table|2l row 4 vs. 8, 9 and 10); (5) two layers 
of CNN can represent a higher level of abstraction with wider 
range in the input. Hence going deeper by using two CNN 
layers improves the accuracy (Table |2] row 4 vs. 11); (6) 
effective learning in deep networks is often a difficult task. 
Layer-wise supervision can alleviate the problem (Table |2l 
row 11 vs. 12); (7) combining bigram and skip-bigram fea¬ 
tures brings gain on Testl but not on Test2 (Table|2] row 4 vs. 
13, row 8 vs. 14, row 9 vs. 15); (8) Table[3]shows that with 
the same model capacity, similarity metric plays an impor¬ 
tant role and the widely used cosine similarity is not the best 
choice for this task. The similarity in Table |3] can be catego¬ 
rized into three classes: Ll-norm based metric which is the 
semantic distance of Q and A summed from each coordinate 
axis; L2-norm based metric which is the straight-line seman¬ 
tic distance of Q and A; inner product based metric which 
measures the angle between Q and A. We propose two new 
metrics that combine L2-norm and inner product by multipli¬ 
cation (GESD Geometric mean of Euclidean and Sigmoid Dot 
product) and addition (AESD Arithmetic mean of Euclidean 
and Sigmoid Dot product). The proposed two metrics are the 
best among all compared metrics. Einally, in the bottom of 
Table [3] it is clear that with more biters, the proposed metric 
can achieve even better performance. 

5. RELATED WORK 

Deep learning technology has been widely used in machine 
learning tasks, often demonstrating superior performance 
compared to traditional methods. Many of those applications 
focus on classibcation related tasks, e.g. on image recogni¬ 
tion El, on speech Eol im flu and on machine translation 
flI3 flMl. This paper is based on many prior works on uti¬ 
lizing deep learning for NLP tasks: Gao et al. iEi proposed 
a CNN based network which maps source-target document 
pairs to embedding vectors such that the distance between 
source documents and their corresponding interesting targets 
is minimized. Lu and Li m propose a CNN based deep 
network for a short text matching task; Hu et al. Q also use 
several CNN based networks for sentence matching; Kalch- 
brenner et al. Qt) use a CNN for sentiment prediction and 
question classibcation; Kim ifTSl uses a CNN in sentiment 
analysis; Zeng et al. flm use a CNN for relation classib¬ 
cation; Socher et al. AMI im use a recursive network for 
paraphrase detection and parsing; lyyer et al. Il22l propose 
a recursive network for factoid question answering; Weston 
et al. fl^ use a CNN for hashtag prediction; Yu et al. 12^ 
use a CNN for answer selection; Yin and Schutze ll24l use a 
bi-CNN for paraphrase identibcation. Our work follows the 
spirit of many previous work in the sense that we utilize CNN 
to map natural language sentences into embedding vectors 
so that the similarity can be calculated. However this paper 
has conducted extensive experiments over various architec¬ 


tures which are not included in previous work. Eurthermore, 
we explored different similarity metrics, skip-bigram based 
convolution and layerwise supervision which have not been 
presented in previous work. 

6. CONCLUSIONS 

In this paper, the spoken question answering system is stud¬ 
ied from an answer selection perspective by employing a deep 
learning framework. The proposed framework does not rely 
on any linguistic tool and can be easily adapted to different 
languages or domains. Our work serves as solid evidence that 
deep learning based QA is an encouraging research direction. 
The scientibc contributions can be summarized as follows: 
(1) creating a new QA task in the insurance domain and re¬ 
leasing a new corpus so that different methods can be fairly 
compared; (2) proposing a general deep learning framework 
with several variants for the QA task and comparison exper¬ 
iments have been conducted; (3) utilizing novel techniques 
that bring improvements: multi-layer CNN with layer-wise 
supervision, augmented CNN with discontinuous convolution 
and novel similarity metric that combine both L2-norm and 
inner product information; (4) the best scores in this paper 
are very promising: for this challenging task (select one an¬ 
swer from a pool with size 500), the top one accuracy of test 
corpus can reach up to 65.3%; (5) for researchers who want to 
proceed with this task, this paper provides valuable guidance: 
a shared layer structure should be adopted; no need to append 
a hidden layer after the CNN; two levels of CNN with layer- 
wise training improves accuracy; discontinuous convolution 
sometimes can help; the similarity metric plays a crucial role 
and the proposed metric is preferred and bnally increasing the 
biter quantity brings improvement. 
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